289 research outputs found
Recommended from our members
Rank-Aware Subspace Clutering for Structured Datasets
In online applications such as Yahoo! Personals and Trulia.com users define structured profiles in order to find potentially interesting matches. Typically, profiles are evaluated against large datasets and produce thousands of matches. In addition to filtering, users also specify ranking in their profile, and matches are returned in the form of a ranked list. Top results in ranked lists are typically homogeneous, which hinders data exploration. For example, a user looking for 1- or 2-bedroom apartments sorted by price will see a large number of cheap 1-bedrooms in undesirable neighborhoods before seeing any apartment with different characteristics. An alternative to ranking is to group matches on common attribute values (e.g., cheap 1-bedrooms in good neighborhoods, 2-bedrooms with 2 baths). However, not all groups will be of interest to the user given the ranking criteria. We argue here that neither single-list ranking nor attribute-based grouping is adequate for effective exploration of ranked datasets. We formalize rank-aware clustering and develop a novel rank-aware bottom-up subspace clustering algorithm. We evaluate the performance of our algorithm over large datasets from a leading online dating site, and present an experimental evaluation of its effectiveness
Testing Interestingness Measures in Practice: A Large-Scale Analysis of Buying Patterns
Understanding customer buying patterns is of great interest to the retail
industry and has shown to benefit a wide variety of goals ranging from managing
stocks to implementing loyalty programs. Association rule mining is a common
technique for extracting correlations such as "people in the South of France
buy ros\'e wine" or "customers who buy pat\'e also buy salted butter and sour
bread." Unfortunately, sifting through a high number of buying patterns is not
useful in practice, because of the predominance of popular products in the top
rules. As a result, a number of "interestingness" measures (over 30) have been
proposed to rank rules. However, there is no agreement on which measures are
more appropriate for retail data. Moreover, since pattern mining algorithms
output thousands of association rules for each product, the ability for an
analyst to rely on ranking measures to identify the most interesting ones is
crucial. In this paper, we develop CAPA (Comparative Analysis of PAtterns), a
framework that provides analysts with the ability to compare the outcome of
interestingness measures applied to buying patterns in the retail industry. We
report on how we used CAPA to compare 34 measures applied to over 1,800 stores
of Intermarch\'e, one of the largest food retailers in France
Personalizing XML Full Text Search in PIMENTO
In PIMENTO we advocate a novel approach to XML search that leverages user information
to return more relevant query answers. This approach is based on formalizing
{em user profiles} in terms of {em scoping rules} which are used to rewrite an input query,
and of {em ordering rules} which are combined with query scoring to customize the ranking
of query answers to specific users
Crowd4U: An Initiative for Constructing an Open Academic Crowdsourcing Network
International audienceWe describe the Crowd4U initiative, which aims at constructing an all-academic open and generic platform for microvolunteering and crowdsourcing worldwide. Crowd4U provides a microtask-based platform in which most workers are volunteers at universities and other research institutions. Crowd4U is open in the sense that the platform can interact with other platforms, researchers can register their tasks, and the underlying code is not a black box. It is generic as it allows to register virtually any task. Crowd4U has already been used by several projects for public and academic purposes
Distributed Evaluation of Top-k Temporal Joins
To appear in SIGMOD'16We study a particular kind of join, coined Ranked Temporal Join (RTJ), featuring predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in a variety of applications such as network traffic monitoring , task scheduling, and tweet analysis. RTJ queries are often best interpreted as top-k queries where only the best matches are returned. We show how to exploit the nature of temporal predicates and the properties of their associated scoring semantics to design TKIJ , an efficient query evaluation approach on a distributed Map-Reduce architecture. TKIJ relies on an offline statistics computation that, given a time partitioning into granules, computes the distribution of intervals' endpoints in each granule, and an online computation that generates query-dependent score bounds. Those statistics are used for workload assignment to reducers. This aims at reducing data replication, to limit I/O cost. Additionally , high-scoring results are distributed evenly to enable each reducer to prune unnecessary results. Our extensive experiments on synthetic and real datasets show that TKIJ outperforms state-of-the-art competitors and provides very good performance for n-ary RTJ queries on temporal data
Profile Diversity for Phenotyping Data Search and Recommendation
Session: Applications innovantesNational audienceDans ce travail, nous étudions la diversité de profils. Il s'agit d'une approche nouvelle dans la recherche de documents scientifiques. De nombreux travaux ont combinés la pertinence des mots clés avec la popularité des documents au sein d'une fonction de score " sociale ". Diversifier le contenu des documents retournés a également été traité de mani'ere approfondie et la recherche, la publicité, les requêtes en base de données et la recommandation. Nous pensons que notre travail est le premier à traiter de la diversité de profils afin de traiter le problème des listes de résultats hautement populaires mais trop ciblées. Nous montrerons comment nous adaptons l'algorithme de Fagin sur les algorithmes à seuil pour retourner les documents les plus pertinents, les plus populaires mais aussi les plus divers que ce soit en terme de contenus ou de profils. Nous avons également un ensemble de simulations sur deux benchmarks afin de valider notre fonction de score
Exploration of User Groups in VEXUS
We introduce VEXUS, an interactive visualization framework for exploring user
data to fulfill tasks such as finding a set of experts, forming discussion
groups and analyzing collective behaviors. User data is characterized by a
combination of demographics like age and occupation, and actions such as rating
a movie, writing a paper, following a medical treatment or buying groceries.
The ubiquity of user data requires tools that help explorers, be they
specialists or novice users, acquire new insights. VEXUS lets explorers
interact with user data via visual primitives and builds an exploration profile
to recommend the next exploration steps. VEXUS combines state-of-the-art
visualization techniques with appropriate indexing of user data to provide fast
and relevant exploration
- …